this is true in this context, but it’s worth saying that these models seem to be trained for these benchmakrs, so it’s not entirely true that they are lower-bounds, since it seems that models are trained to do well on benchmarks.
The size of the “lower bound”ness is hard to comment on, but if you can provide some input on the amount of improvmenet you think there is from tuning your harness, that is a meaningful conclusion. That could be what we build the blog around,
this is true in this context, but it’s worth saying that these models seem to be trained for these benchmakrs, so it’s not entirely true that they are lower-bounds, since it seems that models are trained to do well on benchmarks.
The size of the “lower bound”ness is hard to comment on, but if you can provide some input on the amount of improvmenet you think there is from tuning your harness, that is a meaningful conclusion. That could be what we build the blog around,